Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightDM failsafe handler and restart directives #340

Merged

Conversation

wallentx
Copy link
Contributor

@wallentx wallentx commented Jan 4, 2024

This PR is in reference to #337

In a few different scenarios, I've experienced situations where a child process spawned by LightDM (such as x11) fails to start. As per the instructions in lightdm.service, upon failure, LightDM will Restart=always.
The problem that can come from this, is when one of those services are failing for a very specific reason that requires the user to make changes. In a scenario where such a service will fail 100% of the time, this will cause LightDM to Restart=always 100% of the time, when will eventually trigger systemd's rate-limiting mechanism, resulting in logs that would appear like this:

systemd[1]: lightdm.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped LightDM.                        
systemd[1]: lightdm.service: Start request repeated too quickly.
systemd[1]: Failed to start LightDM.                
systemd[1]: lightdm.service: Unit entered failed state.       
systemd[1]: lightdm.service: Failed with result 'exit-code'. 

The problem is that, depending on the failure, the user might only see a black screen, with no indication that anything is wrong, except that the screen is black, and nothing is loading. The user would only be able to discover details in the aforementioned logs if they were to switch to a different TTY, log in, and check their logs. From here, they would be able to discover why the failure happened, and fix the issue.

The changes that I am proposing are essentially a shortcut to what I just described.

The lightdm.service file now includes StartLimit IntervalSec and Burst values, a RestartSec rate, and an OnFailure action. The values I've proposed I think make sense:
The service has a restart delay of 5 seconds in between resets, and if the service restarts more than 5 times within a 60s period, then the service will be in a failed state. In the event of it being in a failed state, it triggers the OnFailure action:debian/lightdm-failure-handler.service

This service makes a copy of /etc/issue (the MOTD-like message you see above a TTY login) to /etc/issue.restore, and writes in a new message "hint" to /etc/issue that lets the user know what to do. It then changes the user to TTY4 (I only chose 4 because I've noticed others to be reserved for other things, or unavailable.. someone should check my reasoning here though) so that they can see that message, and log in to troubleshoot.

Upon successful login of LightDM, if an /etc/issue.restore file exists, it will replace /etc/issue (which presumably has the "hint" we wrote earlier). If no .restore file is found, no action is taken.

I couldn't find anything else in the repo specifically that referenced the packaging or manifest of the .service files, but maybe that is handled elsewhere.

Copy link

github-actions bot commented Jan 4, 2024

Everyone contributing to this PR have now signed the CLA. Thanks!

@wallentx
Copy link
Contributor Author

wallentx commented Jan 4, 2024

Hey! wallentx has not signed the Canonical CLA which is required to get this contribution merged on this project.
Please head over to ubuntu.com/legal/contributors to read more about it.

I signed this, btw, but maybe my launchpad ID was supposed to be my launchpad account email?

@robert-ancell
Copy link
Collaborator

Can you please resolve the above conflicts? Thanks.

@robert-ancell
Copy link
Collaborator

Sorry, @wallentx needs a rebase again.

@iphands
Copy link

iphands commented Mar 28, 2024

This is great @wallentx I want this too!

Its rather maddening when you get caught in a restart loop and as you are at some tty, the restart occurs and in the middle of typing you get flipped back to tty1 as things fail.

Having some back off / retry limit would be great ootb config!

@wallentx wallentx force-pushed the wallentx/lightdm-failsafe-handler branch from 939c34d to 0c3b360 Compare April 6, 2024 05:31
Copy link

github-actions bot commented Apr 6, 2024

Hey! wallentx has not signed the Canonical CLA which is required to get this contribution merged on this project.

Please head over to https://ubuntu.com/legal/contributors to read more about it.

@wallentx
Copy link
Contributor Author

wallentx commented Jun 7, 2024

Hey @robert-ancell I've been held up on attempting to fix my Canonical CLA issue because I found a bug/exploit/bypass with the Canonical cla-check itself, and have a PR that shows it's existence, and also includes the fix canonical/has-signed-canonical-cla#38

If I fix my CLA, then the evidence of the flaw disappears. Do you know anyone who might be able to at least ack the details of my other PR so that I can be freed up to move forward here?

@robert-ancell robert-ancell merged commit f50ba25 into canonical:main Jun 10, 2024
1 check failed
@robert-ancell
Copy link
Collaborator

@wallentx thanks for working on this! I've pinged the person who seems to be maintaining the CLA checker and hopefully they'll land that fix soon :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants